The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# library to encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn import set_config
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
Necessary libraries have been imported
thera = pd.read_csv("BankChurners.csv")
# checking number of rows and columns
thera.shape
(10127, 21)
The dataset presents 10,127 rows and 21 columns
# make a copy of the original dataset to view data summary
df = thera.copy()
# view first 5 rows of data
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
# view last 5 rows of data
df.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.000 | 1851 | 2152.000 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.000 | 0 | 5409.000 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.000 | 0 | 5281.000 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.000 | 1961 | 8427.000 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
# let's check the data types of the columns in the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
# let's check for duplicate values in the data
df.duplicated().sum()
0
There are no duplicated rows. But this could be because CLIENTNUM is an unique value. We will check for duplicates again after dropping CLIENTNUM.
# let's check for missing values in the data
round(df.isnull().sum() / df.isnull().count() * 100, 2)
CLIENTNUM 0.000 Attrition_Flag 0.000 Customer_Age 0.000 Gender 0.000 Dependent_count 0.000 Education_Level 15.000 Marital_Status 7.400 Income_Category 0.000 Card_Category 0.000 Months_on_book 0.000 Total_Relationship_Count 0.000 Months_Inactive_12_mon 0.000 Contacts_Count_12_mon 0.000 Credit_Limit 0.000 Total_Revolving_Bal 0.000 Avg_Open_To_Buy 0.000 Total_Amt_Chng_Q4_Q1 0.000 Total_Trans_Amt 0.000 Total_Trans_Ct 0.000 Total_Ct_Chng_Q4_Q1 0.000 Avg_Utilization_Ratio 0.000 dtype: float64
# Checking for the null value in the dataset
df.isna().sum()
CLIENTNUM 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
df.nunique()
CLIENTNUM 10127 Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 6 Marital_Status 3 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
Relationship: These are the parameters that can indicate the bank's relationship with the customer, or whose indication can trigger a relationship
Credit parameters: Refers to credit balance, transactions and others.
Customer Profile features:
# let's view the statistical summary of the numerical columns in the data
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
To proceed into EDA, let us first create lists of numerical and categorical columns. This will helps us to group and present different views.
# create numerical columns list
num_cols = [
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
# Making a list of all categorical variables
cat_col = [
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
"Attrition_Flag",
]
Checking the value count for each category of categorical variables
# Printing number of count of each unique value in each column
for column in cat_col:
print(df[column].value_counts(dropna=False))
print("-" * 40)
F 5358 M 4769 Name: Gender, dtype: int64 ---------------------------------------- Graduate 3128 High School 2013 NaN 1519 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ---------------------------------------- Married 4687 Single 3943 NaN 749 Divorced 748 Name: Marital_Status, dtype: int64 ---------------------------------------- Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 ---------------------------------------- Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 ---------------------------------------- Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ----------------------------------------
Before we begin EDA, let us do the following steps to prepare the data:
# making a copy of the original dataframe
data = thera.copy()
# from the statistical summary, we understand that CLIENTNUM is an unique customer identifier and there are no duplicates
# so we can drop CLIENTNUM column from the numerical columns
data = data.drop(columns=["CLIENTNUM"])
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="husl"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
display(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False,
)
tab.plot(kind="bar", stacked=True, figsize=(count + 3, 3), cmap="coolwarm")
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# defining bins for Customer_Age
bins_age = (20, 30, 40, 50, 60, 75)
# defining labels
labels_age = ["20-30", "30-40", "40-50", "50-60", "60-75"]
data["age_bin"] = pd.cut(x=data["Customer_Age"], bins=bins_age, labels=labels_age)
# defining bins for Months_on_book
bins_months = (0, 12, 24, 36, 48, 60)
# defining labels
labels_months = ["0-12", "12-24", "24-36", "36-48", "48-60"]
data["bin_months"] = pd.cut(
x=data["Months_on_book"], bins=bins_months, labels=labels_months
)
# defining bins for Total_Revolving_Bal
bins_bal = (0, 500, 1000, 1500, 2000, 2500)
# defining labels
labels_bal = ["0-500", "500-1000", "1000-1500", "1500-2000", "2000-2500"]
data["bin_bal"] = pd.cut(
x=data["Total_Revolving_Bal"], bins=bins_bal, labels=labels_bal
)
# defining bins for Total_Trans_Ct
bins_ct = (10, 50, 100, 150)
# defining labels
labels_ct = ["10-50", "50-100", "100-150"]
data["bin_ct"] = pd.cut(x=data["Total_Trans_Ct"], bins=bins_ct, labels=labels_ct)
# defining bins for Total_Trans_Amt
bins_amt = (500, 5000, 10000, 15000, 20000)
# defining labels
labels_amt = ["500-5000", "5000-10000", "10000-15000", "15000-20000"]
data["bin_amt"] = pd.cut(x=data["Total_Trans_Amt"], bins=bins_amt, labels=labels_amt)
We have already created a list of numerical and categorical features. We can break them down even further into meaningful lists.
# combining features that can indicate relationship with a bank
relationship = [
"bin_months",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
]
# combining credit related parameters
credit = [
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Total_Amt_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
# combining customer profile features
cust_prof = [
"Gender",
"Income_Category",
"age_bin",
"Dependent_count",
"Education_Level",
"Marital_Status",
"Card_Category",
]
# plotting countplot for the binned Customer_Age column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="age_bin", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting histogram and boxplot for Customer_Age
histogram_boxplot(data, "Customer_Age")
# histograms and boxplots for Credit_Limit
histogram_boxplot(data, "Credit_Limit")
# histograms and boxplots for Avg_Open_To_Buy
histogram_boxplot(data, "Avg_Open_To_Buy")
# histograms and boxplots for Total_Amt_Chng_Q4_Q1
histogram_boxplot(data, "Total_Amt_Chng_Q4_Q1")
# histograms and boxplots for Total_Ct_Chng_Q4_Q1
histogram_boxplot(data, "Total_Ct_Chng_Q4_Q1")
# histograms and boxplots for Avg_Utilization Ratio
histogram_boxplot(data, "Avg_Utilization_Ratio")
# plotting countplot for the binned Total_Revolving_Bal column
plt.figure(figsize=(7, 5))
ax = sns.countplot(data=data, x="bin_bal", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# histograms and boxplots for Total_Revolving_Bal
histogram_boxplot(data, "Total_Revolving_Bal")
# plotting countplot for the binned Total_Trans_Amt column
plt.figure(figsize=(7, 5))
ax = sns.countplot(data=data, x="bin_amt", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# histograms and boxplots for Total_Trans_Amt
histogram_boxplot(data, "Total_Trans_Amt")
# plotting countplot for the binned Total_Trans_Ct column
plt.figure(figsize=(7, 5))
ax = sns.countplot(data=data, x="bin_ct", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# histograms and boxplots for Total_Trans_Ct
histogram_boxplot(data, "Total_Trans_Ct")
# plotting countplot for the Gender column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="Gender", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Dependent_count column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="Dependent_count", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Education_Level column
plt.figure(figsize=(7, 5))
ax = sns.countplot(data=data, x="Education_Level", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Marital_Status column
plt.figure(figsize=(7, 5))
ax = sns.countplot(data=data, x="Marital_Status", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Income_Category column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="Income_Category", palette="viridis")
ax.bar_label(ax.containers[0])
plt.xticks(rotation=45)
plt.show()
# plotting countplot for the Card_Category column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="Card_Category", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Attrition_Flag column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="Attrition_Flag", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Months_on_book column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="bin_months", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Total_Relationship_Count column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="Total_Relationship_Count", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Months_Inactive_12_mon column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="Months_Inactive_12_mon", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plotting countplot for the Contacts_Count_12_mon column
plt.figure(figsize=(5, 5))
ax = sns.countplot(data=data, x="Contacts_Count_12_mon", palette="viridis")
ax.bar_label(ax.containers[0])
plt.show()
# plot Gender vs Attrition_Flag
stacked_barplot(data, "Gender", "Attrition_Flag")
| Attrition_Flag | Attrited Customer | Existing Customer | All |
|---|---|---|---|
| Gender | |||
| All | 1627 | 8500 | 10127 |
| F | 930 | 4428 | 5358 |
| M | 697 | 4072 | 4769 |
------------------------------------------------------------------------------------------------------------------------
# plot Education_Level vs Attrition_Flag
stacked_barplot(data, "Education_Level", "Attrition_Flag")
| Attrition_Flag | Attrited Customer | Existing Customer | All |
|---|---|---|---|
| Education_Level | |||
| All | 1371 | 7237 | 8608 |
| Graduate | 487 | 2641 | 3128 |
| High School | 306 | 1707 | 2013 |
| Uneducated | 237 | 1250 | 1487 |
| College | 154 | 859 | 1013 |
| Doctorate | 95 | 356 | 451 |
| Post-Graduate | 92 | 424 | 516 |
------------------------------------------------------------------------------------------------------------------------
# plot Marital_Status vs Attrition_Flag
stacked_barplot(data, "Marital_Status", "Attrition_Flag")
| Attrition_Flag | Attrited Customer | Existing Customer | All |
|---|---|---|---|
| Marital_Status | |||
| All | 1498 | 7880 | 9378 |
| Married | 709 | 3978 | 4687 |
| Single | 668 | 3275 | 3943 |
| Divorced | 121 | 627 | 748 |
------------------------------------------------------------------------------------------------------------------------
# plot Income_Category vs Attrition_Flag
stacked_barplot(data, "Income_Category", "Attrition_Flag")
| Attrition_Flag | Attrited Customer | Existing Customer | All |
|---|---|---|---|
| Income_Category | |||
| All | 1627 | 8500 | 10127 |
| Less than $40K | 612 | 2949 | 3561 |
| $40K - $60K | 271 | 1519 | 1790 |
| $80K - $120K | 242 | 1293 | 1535 |
| $60K - $80K | 189 | 1213 | 1402 |
| abc | 187 | 925 | 1112 |
| $120K + | 126 | 601 | 727 |
------------------------------------------------------------------------------------------------------------------------
# plot Card_Category vs Attrition_Flag
stacked_barplot(data, "Card_Category", "Attrition_Flag")
| Attrition_Flag | Attrited Customer | Existing Customer | All |
|---|---|---|---|
| Card_Category | |||
| All | 1627 | 8500 | 10127 |
| Blue | 1519 | 7917 | 9436 |
| Silver | 82 | 473 | 555 |
| Gold | 21 | 95 | 116 |
| Platinum | 5 | 15 | 20 |
------------------------------------------------------------------------------------------------------------------------
# heatmap for correlation
plt.figure(figsize=(15, 7))
sns.heatmap(
data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral", cbar=False
)
plt.show()
Let's plot the numerical variables against the Attrition_Flag and see if any variables have an influence on the attrition flag.
# for all numerical variables draw box plots
plt.figure(figsize=(20, 35))
for i, variable in enumerate(num_cols):
plt.subplot(7, 2, i + 1)
sns.boxplot(data["Attrition_Flag"], data[variable], palette="viridis")
plt.tight_layout()
plt.title(variable)
plt.show()
# pairplot to understand relationships
plt.figure(figsize=(15, 7))
sns.pairplot(data, hue="Attrition_Flag", corner=True)
plt.show()
<Figure size 1080x504 with 0 Axes>
# Mean of numerical variables grouped by attrition flag
data.groupby(["Attrition_Flag"])[num_cols].mean()
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Attrition_Flag | ||||||||||||||
| Attrited Customer | 46.659 | 2.403 | 36.178 | 3.280 | 2.693 | 2.972 | 8136.039 | 672.823 | 7463.216 | 0.694 | 3095.026 | 44.934 | 0.554 | 0.162 |
| Existing Customer | 46.262 | 2.335 | 35.881 | 3.915 | 2.274 | 2.356 | 8726.878 | 1256.604 | 7470.273 | 0.773 | 4654.656 | 68.673 | 0.742 | 0.296 |
From the different plots above, we can infer the following:
Let's build a profile of customers who have attrited from the bank, by first isolating the Attrited Customers into a dataframe. The key assumption to note here is that we are making these inferences out of 17% of the total dataset, therefore these inferences may tend to vary when we have a larger representation.
# collecting attrited customers into a dataframe
attrited = data[data["Attrition_Flag"] == "Attrited Customer"]
attrited.shape
(1627, 25)
# plotting the different customer profile features against the Card_Category
plt.figure(figsize=(15, 20))
for i, variable in enumerate(cust_prof):
if variable != "Card_Category":
plt.subplot(4, 2, i + 1)
sns.countplot(
data=attrited, x="Card_Category", hue=data[variable],
)
plt.tight_layout()
plt.title(variable)
plt.xticks(rotation=45)
plt.legend(loc="upper right")
plt.show()
# plotting the relationship parameters against the card category
plt.figure(figsize=(15, 20))
for i, variable in enumerate(relationship):
plt.subplot(4, 2, i + 1)
sns.countplot(
data=attrited, x="Card_Category", hue=data[variable],
)
plt.tight_layout()
plt.title(variable)
plt.xticks(rotation=45)
plt.legend(loc="upper right")
plt.show()
# plotting credit & balance related features against Card_Category
plt.figure(figsize=(15, 20))
for i, col in enumerate(credit):
# create bins using numpy
bins = np.linspace(attrited[col].min(), attrited[col].max(), 5)
# create a new column with bin labels
bin_labels = [f"{round(bins[x])}-{round(bins[x+1])}" for x in range(len(bins) - 1)]
attrited[f"{col}_bin"] = pd.cut(
attrited[col], bins=bins, labels=bin_labels, ordered=False
)
# plot a countplot
plt.subplot(4, 2, i + 1)
sns.countplot(x="Card_Category", hue=f"{col}_bin", data=attrited)
plt.tight_layout()
plt.title(col)
plt.xticks(rotation=45)
plt.legend(loc="upper right")
plt.show()
Customer Profile
Relationship Strategy
Credit & Balance
We'll detect outliers using the IQR method.
# calculating the Q1 and Q3 values outside of which outliers exist
# this code will print the columns with the number of outliers, and the upper and lower limits of the outliers
for col in num_cols:
quartiles = np.quantile(data[col], [0.25, 0.75])
Q1 = quartiles[0]
Q3 = quartiles[1]
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = np.logical_or(data[col] < lower, data[col] > upper)
print(
"Number of Outliers in", col, "are", outliers.sum(),
)
if outliers.sum() > 0:
if lower > 0:
print("Outlier values of", col, "are lower than", f"{lower}")
if upper > 0:
print("Outlier values of", col, "are greater than", f"{upper}")
print("\n")
Number of Outliers in Customer_Age are 2 Outlier values of Customer_Age are lower than 24.5 Outlier values of Customer_Age are greater than 68.5 Number of Outliers in Dependent_count are 0 Number of Outliers in Months_on_book are 386 Outlier values of Months_on_book are lower than 17.5 Outlier values of Months_on_book are greater than 53.5 Number of Outliers in Total_Relationship_Count are 0 Number of Outliers in Months_Inactive_12_mon are 331 Outlier values of Months_Inactive_12_mon are lower than 0.5 Outlier values of Months_Inactive_12_mon are greater than 4.5 Number of Outliers in Contacts_Count_12_mon are 629 Outlier values of Contacts_Count_12_mon are lower than 0.5 Outlier values of Contacts_Count_12_mon are greater than 4.5 Number of Outliers in Credit_Limit are 984 Outlier values of Credit_Limit are greater than 23836.25 Number of Outliers in Total_Revolving_Bal are 0 Number of Outliers in Avg_Open_To_Buy are 963 Outlier values of Avg_Open_To_Buy are greater than 22660.75 Number of Outliers in Total_Amt_Chng_Q4_Q1 are 396 Outlier values of Total_Amt_Chng_Q4_Q1 are lower than 0.28900000000000003 Outlier values of Total_Amt_Chng_Q4_Q1 are greater than 1.201 Number of Outliers in Total_Trans_Amt are 896 Outlier values of Total_Trans_Amt are greater than 8619.25 Number of Outliers in Total_Trans_Ct are 2 Outlier values of Total_Trans_Ct are greater than 135.0 Number of Outliers in Total_Ct_Chng_Q4_Q1 are 394 Outlier values of Total_Ct_Chng_Q4_Q1 are lower than 0.22799999999999998 Outlier values of Total_Ct_Chng_Q4_Q1 are greater than 1.172 Number of Outliers in Avg_Utilization_Ratio are 0
data[(data["Customer_Age"] < 24.5) | (data["Customer_Age"] > 68.5)]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | age_bin | bin_months | bin_bal | bin_ct | bin_amt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 251 | Existing Customer | 73 | M | 0 | High School | Married | $40K - $60K | Blue | 36 | 5 | 3 | 2 | 4469.000 | 1125 | 3344.000 | 1.363 | 1765 | 34 | 1.615 | 0.252 | 60-75 | 24-36 | 1000-1500 | 10-50 | 500-5000 |
| 254 | Existing Customer | 70 | M | 0 | High School | Married | Less than $40K | Blue | 56 | 3 | 2 | 3 | 3252.000 | 1495 | 1757.000 | 0.581 | 1227 | 15 | 0.875 | 0.460 | 60-75 | 48-60 | 1000-1500 | 10-50 | 500-5000 |
Outliers (70,73) are closer to the upper limit of 68.5.
data[(data["Total_Trans_Ct"] > 135)]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | age_bin | bin_months | bin_bal | bin_ct | bin_amt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9324 | Existing Customer | 41 | M | 3 | NaN | Married | $120K + | Blue | 33 | 2 | 4 | 3 | 34516.000 | 638 | 33878.000 | 0.724 | 13085 | 139 | 0.675 | 0.018 | 40-50 | 24-36 | 500-1000 | 100-150 | 10000-15000 |
| 9586 | Existing Customer | 56 | F | 1 | High School | Married | abc | Blue | 49 | 1 | 2 | 1 | 17542.000 | 2517 | 15025.000 | 0.800 | 13939 | 138 | 0.792 | 0.143 | 50-60 | 48-60 | NaN | 100-150 | 10000-15000 |
Outliers (138,139) are closer to the upper limit of 135.0.
# Understanding the outliers in Credit_Limit
# check the relationship with "Income_Category", "Total_Revolving_Bal","Education_Level"
cols = ["Income_Category", "bin_bal","Education_Level"]
#upper limit identified in outlier detection
lim = 23836.25
#isolating the data that exceeds the upper limit
data_lim = data[data["Credit_Limit"] > lim]
plt.figure(figsize=(12,12))
# Loop through variables and plot countplots
for i, col in enumerate(cols):
plt.subplot(4, 2, i + 1)
sns.countplot(data=data_lim, x=col,)
plt.tight_layout()
# Show plot
plt.show()
# Understanding the outliers in Months_Inactive_12_mon
# check the relationship with "Income_Category", "Contacts_Count_12_mon", "Months_on_book"
cols = ["Income_Category", "Contacts_Count_12_mon", "bin_months"]
# upper limit identified in outlier detection
lim = 4.5
# isolating the data that exceeds the upper limit
data_lim = data[data["Months_Inactive_12_mon"] > lim]
plt.figure(figsize=(12, 10))
# Loop through variables and plot countplots
for i, col in enumerate(cols):
plt.subplot(4, 2, i + 1)
sns.countplot(
data=data_lim, x=col,
)
plt.tight_layout()
# Show plot
plt.show()
# # Understanding the outliers in Avg_Open_To_Buy
# check the relationship with "Credit_Limit"
cols_num = ["Credit_Limit"]
# upper limit identified in outlier detection
lim = 22660.75
# isolating the data that exceeds the upper limit
data_lim = data[data["Avg_Open_To_Buy"] > lim]
plt.figure(figsize=(12, 10))
# Loop through variables and plot countplots
for i, col in enumerate(cols_num):
plt.subplot(2, 2, i + 1)
sns.histplot(
data=data_lim, x=col,
)
plt.tight_layout()
plt.show()
# check the relationship with "Income_Category", Total_Revolving_Bal"
cols_cat = ["Income_Category", "bin_bal"]
plt.figure(figsize=(12, 10))
for i, col in enumerate(cols_cat):
plt.subplot(2, 2, i + 1)
sns.countplot(
data=data_lim, x=col,
)
plt.tight_layout()
# Show plot
plt.show()
Since the outliers were found to be significant and relevant, we will not drop these outliers so as not to lose significant data. We will also not transform these outliers since there is always the possibility that they will present themselves in production data.
We will build the models in the following manner
The target variable Attrition_Flag will be encoded as below:
When we build the model, the following outcomes can be wrongly predicted.
Predicting a customer will stay/retain credit card services, but actually will not. (Predicts Negative, but Actually Positive: Case of False Negative) In this case, the bank would not have targeted this customer for improving relationships. A wrong prediction of this sort can be very expensive to the company by losing the customer as well as the loss of potential fees and other sources of revenue.
Predicting a customer will attrite, but actually will not. (Predicts Positives, but Actually Negative:: Case of False Positives) In this case, when the customer is identified with the potential to attrite, the company would spend resources to improve and continue the relationship with the customer. Of course, in this case loss of resources could have been avoided, but this loss is not as severe as the above case.
The stated objective of the business is to predict which customer is likely to leave the product, therefore it is necessary to identify as many potential customers as possible, and reduce False Negatives. Therefore, we must aim at increasing recall.
Recall is the metric of interest here.
# make a copy of the dataframe
df = thera.copy()
# Dropping column - CLIENTNUM
df.drop(columns=["CLIENTNUM"], inplace=True)
Now that CLIENTNUM has been dropped, we can do another round of duplicate check to verify there is indeed no duplicates in the dataframe.
df.duplicated().sum()
0
There are no duplicates in the dataframe.
We have seen from the EDA that there are values like "abc" in the Income_Category column. Let's replace it with nan values.
# count of values in Income_Category
df.Income_Category.value_counts(dropna=False)
Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64
There are 1112 values rows with the wrong value.
# replace abc values with nan
df.Income_Category.replace(to_replace="abc", value=np.nan, inplace=True)
# verifying the counts again
df.Income_Category.value_counts(dropna=False)
Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 NaN 1112 $120K + 727 Name: Income_Category, dtype: int64
1112 values have been replaced with nan. We can try to impute this later.
# Separating target variable and other variables
X = df.drop(columns="Attrition_Flag")
# make dependent variable
y = df["Attrition_Flag"]
# check head
X.head()
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
# check y value counts
y.value_counts()
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64
# Splitting data into training, validation and test sets:
# first we split data into 2 parts, say temporary and test
# split into 80% temp and 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation in 75-25 ratio
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)
In order to prevent any data leakage, we will perform the next operations on the different datasets separately.
# encode the target variable in train set
y_train.replace(
to_replace={"Attrited Customer": 1, "Existing Customer": 0}, inplace=True
)
# encode the target variable in validation set
y_val.replace(to_replace={"Attrited Customer": 1, "Existing Customer": 0}, inplace=True)
# encode the target variable in test set
y_test.replace(
to_replace={"Attrited Customer": 1, "Existing Customer": 0}, inplace=True
)
# printing the value counts in each set after encoding
print("y_train")
display(y_train.value_counts())
print("\n")
print("y_val")
display(y_val.value_counts())
print("\n")
print("y_test")
display(y_test.value_counts())
y_train
0 5099 1 976 Name: Attrition_Flag, dtype: int64
y_val
0 1700 1 326 Name: Attrition_Flag, dtype: int64
y_test
0 1701 1 325 Name: Attrition_Flag, dtype: int64
From our EDA, we have found that there are missing values in Education_Level, Marital_Status, Income_Category. Let's impute these missing values with the most frequent values in each column.
# Let's impute the missing values with most frequent strategy
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
# list of columns which require imputation
cols_to_impute = ["Education_Level", "Marital_Status", "Income_Category"]
# fit and transform the imputer on train data
X_train[cols_to_impute] = imp_mode.fit_transform(X_train[cols_to_impute])
# Transform on validation and test data
X_val[cols_to_impute] = imp_mode.transform(X_val[cols_to_impute])
# fit and transform the imputer on test data
X_test[cols_to_impute] = imp_mode.transform(X_test[cols_to_impute])
Let's verify whether any null values exist in the datasets after imputing
X_train.isnull().sum()
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
X_val.isnull().sum()
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
X_test.isnull().sum()
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
There are no null values in any columns.
We can encode categorical variables using OneHotEncoding. Since this is a classification problem, let's not drop the first column after encoding.
# verify the dataframe before encoding
X_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6075 entries, 9501 to 703 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 6075 non-null int64 1 Gender 6075 non-null object 2 Dependent_count 6075 non-null int64 3 Education_Level 6075 non-null object 4 Marital_Status 6075 non-null object 5 Income_Category 6075 non-null object 6 Card_Category 6075 non-null object 7 Months_on_book 6075 non-null int64 8 Total_Relationship_Count 6075 non-null int64 9 Months_Inactive_12_mon 6075 non-null int64 10 Contacts_Count_12_mon 6075 non-null int64 11 Credit_Limit 6075 non-null float64 12 Total_Revolving_Bal 6075 non-null int64 13 Avg_Open_To_Buy 6075 non-null float64 14 Total_Amt_Chng_Q4_Q1 6075 non-null float64 15 Total_Trans_Amt 6075 non-null int64 16 Total_Trans_Ct 6075 non-null int64 17 Total_Ct_Chng_Q4_Q1 6075 non-null float64 18 Avg_Utilization_Ratio 6075 non-null float64 dtypes: float64(5), int64(9), object(5) memory usage: 949.2+ KB
#list of columns to encode
cols_to_encode = [
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
We only need to encode a few columns from the dataset. So we will need to extract these columns first, encode them and the concatenate to recreate the original dataframe.
# Create a OneHotEncoder object
# handle_unknown = "ignore", allows model to handle any unknown category in the test data
ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")
# create an array after onehotencoding the columns
array_train_encoded = ohe.fit_transform(X_train[cols_to_encode])
# Convert the above array to a dataframe
data_train_encoded = pd.DataFrame(
array_train_encoded,
index=X_train.index,
columns=ohe.get_feature_names(cols_to_encode),
)
# Extract only the columns that didnt need to be encoded
data_other_cols = X_train.drop(columns=cols_to_encode)
# Concatenate the two dataframes and recreate the dataset
X_train = pd.concat([data_train_encoded, data_other_cols], axis=1)
X_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6075 entries, 9501 to 703 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender_F 6075 non-null float64 1 Gender_M 6075 non-null float64 2 Education_Level_College 6075 non-null float64 3 Education_Level_Doctorate 6075 non-null float64 4 Education_Level_Graduate 6075 non-null float64 5 Education_Level_High School 6075 non-null float64 6 Education_Level_Post-Graduate 6075 non-null float64 7 Education_Level_Uneducated 6075 non-null float64 8 Marital_Status_Divorced 6075 non-null float64 9 Marital_Status_Married 6075 non-null float64 10 Marital_Status_Single 6075 non-null float64 11 Income_Category_$120K + 6075 non-null float64 12 Income_Category_$40K - $60K 6075 non-null float64 13 Income_Category_$60K - $80K 6075 non-null float64 14 Income_Category_$80K - $120K 6075 non-null float64 15 Income_Category_Less than $40K 6075 non-null float64 16 Card_Category_Blue 6075 non-null float64 17 Card_Category_Gold 6075 non-null float64 18 Card_Category_Platinum 6075 non-null float64 19 Card_Category_Silver 6075 non-null float64 20 Customer_Age 6075 non-null int64 21 Dependent_count 6075 non-null int64 22 Months_on_book 6075 non-null int64 23 Total_Relationship_Count 6075 non-null int64 24 Months_Inactive_12_mon 6075 non-null int64 25 Contacts_Count_12_mon 6075 non-null int64 26 Credit_Limit 6075 non-null float64 27 Total_Revolving_Bal 6075 non-null int64 28 Avg_Open_To_Buy 6075 non-null float64 29 Total_Amt_Chng_Q4_Q1 6075 non-null float64 30 Total_Trans_Amt 6075 non-null int64 31 Total_Trans_Ct 6075 non-null int64 32 Total_Ct_Chng_Q4_Q1 6075 non-null float64 33 Avg_Utilization_Ratio 6075 non-null float64 dtypes: float64(25), int64(9) memory usage: 1.6 MB
We can see the new encoded columns in X_train, with data types appropriate for model building. Let's transform the remaining datasets.
# Transform the validation set
array_val_encoded = ohe.transform(X_val[cols_to_encode])
# Convert it to df
data_val_encoded = pd.DataFrame(
array_val_encoded, index=X_val.index, columns=ohe.get_feature_names(cols_to_encode),
)
# Extract only the columns that didnt need to be encoded
data_other_cols = X_val.drop(columns=cols_to_encode)
# Concatenate the two dataframes :
X_val = pd.concat([data_val_encoded, data_other_cols], axis=1)
# Transform the test set
array_test_encoded = ohe.transform(X_test[cols_to_encode])
# Convert it to df
data_test_encoded = pd.DataFrame(
array_test_encoded,
index=X_test.index,
columns=ohe.get_feature_names(cols_to_encode),
)
# Extract only the columns that didnt need to be encoded
data_other_cols = X_test.drop(columns=cols_to_encode)
# Concatenate the two dataframes :
X_test = pd.concat([data_test_encoded, data_other_cols], axis=1)
# verify the shape of each dataset after creating dummy variables
print("Shape of X_train: ", X_train.shape)
print("Shape of X_val: ",X_val.shape)
print("Shape of X_test: ",X_test.shape)
Shape of X_train: (6075, 34) Shape of X_val: (2026, 34) Shape of X_test: (2026, 34)
Before we head into creating the functions, let's define few functions that will help us to build confusion matrices and performances matrices.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Empty list to store all the models
models = []
# Appending models into the list
models.append(("Bagging_def", BaggingClassifier(random_state=1)))
models.append(("RandomForest_def", RandomForestClassifier(random_state=1)))
models.append(("GBM_def", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost_def", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost_def", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree_def", DecisionTreeClassifier(random_state=1)))
models.append(("logreg_def", LogisticRegression(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
# empty dataframe to scores of all default models
score_train_df = pd.DataFrame()
score_val_df = pd.DataFrame()
#calculate metrics for train and validation sets for each model
#concatenate model performance in the respective dataframe
for name, model in models:
model.fit(X_train, y_train)
score_train = model_performance_classification_sklearn(model, X_train, y_train)
score_val = model_performance_classification_sklearn(model, X_val, y_val)
score_val = score_val.set_index(pd.Index([name]))
score_val_df = pd.concat([score_val_df, score_val])
score_train = score_train.set_index(pd.Index([name]))
score_train_df = pd.concat([score_train_df, score_train])
# rename index of the final dataframe
score_train_df.index.name = "Model"
score_val_df.index.name = "Model"
# print the dataframes
print ("\n")
print ("Training Performance""\n")
display (score_train_df)
print ("\n")
print ("Validation Performance""\n")
display (score_val_df)
Cross-Validation Performance: Bagging_def: 79.10361067503925 RandomForest_def: 74.59236002093145 GBM_def: 83.40711669283097 Adaboost_def: 83.09471480900052 Xgboost_def: 87.29722658294087 dtree_def: 78.38566195709053 logreg_def: 47.43380429094716 Training Performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Model | ||||
| Bagging_def | 0.996 | 0.978 | 0.997 | 0.988 |
| RandomForest_def | 1.000 | 1.000 | 1.000 | 1.000 |
| GBM_def | 0.978 | 0.905 | 0.957 | 0.930 |
| Adaboost_def | 0.961 | 0.850 | 0.903 | 0.876 |
| Xgboost_def | 1.000 | 1.000 | 1.000 | 1.000 |
| dtree_def | 1.000 | 1.000 | 1.000 | 1.000 |
| logreg_def | 0.881 | 0.402 | 0.738 | 0.520 |
Validation Performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Model | ||||
| Bagging_def | 0.952 | 0.776 | 0.913 | 0.839 |
| RandomForest_def | 0.940 | 0.687 | 0.918 | 0.786 |
| GBM_def | 0.961 | 0.816 | 0.933 | 0.871 |
| Adaboost_def | 0.955 | 0.813 | 0.898 | 0.853 |
| Xgboost_def | 0.969 | 0.865 | 0.937 | 0.900 |
| dtree_def | 0.932 | 0.755 | 0.812 | 0.782 |
| logreg_def | 0.869 | 0.325 | 0.702 | 0.444 |
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(15, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Before we proceed further, let's once again check the value counts of the target variable in each set
# Checking class balance for whole data, train set, validation set, and test set
print("Target value ratio in y")
print(y.value_counts())
print("*" * 80)
print("Target value ratio in y_train")
print(y_train.value_counts())
print("*" * 80)
print("Target value ratio in y_val")
print(y_val.value_counts())
print("*" * 80)
print("Target value ratio in y_test")
print(y_test.value_counts())
print("*" * 80)
Target value ratio in y Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ******************************************************************************** Target value ratio in y_train 0 5099 1 976 Name: Attrition_Flag, dtype: int64 ******************************************************************************** Target value ratio in y_val 0 1700 1 326 Name: Attrition_Flag, dtype: int64 ******************************************************************************** Target value ratio in y_test 0 1701 1 325 Name: Attrition_Flag, dtype: int64 ********************************************************************************
We have maintained the class proportions in each dataset as it was in the original dataset.
Let's use SMOTE techniques to create oversampled dataset. We will try to oversample the underrepresented class (1) by 40% (i.e., sampling_strategy=0.4).
Oversampling is only applied on the train dataset and not on others.
# Fit SMOTE on train data(Synthetic Minority Oversampling Technique)
sm = SMOTE(sampling_strategy=0.4, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
# print the counts
print("Before OverSampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, count of label '0': {} \n".format(sum(y_train == 0)))
print("After OverSampling, count of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, count of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, count of label '1': 976 Before OverSampling, count of label '0': 5099 After OverSampling, count of label '1': 2039 After OverSampling, count of label '0': 5099 After OverSampling, the shape of train_X: (7138, 34) After OverSampling, the shape of train_y: (7138,)
We can see that the new X_train after oversampling has 40% more data in the underrepresented class.
Once we build the oversampled models, we will concatenate the results into the previous dataframe that contained scores of the default models. The consolidated view of the model performances can help in better comparison.
models_over=[]
# Appending models_over into the list
models_over.append(("Bagging_over", BaggingClassifier(random_state=1)))
models_over.append(("RandomForest_over", RandomForestClassifier(random_state=1)))
models_over.append(("GBM_over", GradientBoostingClassifier(random_state=1)))
models_over.append(("Adaboost_over", AdaBoostClassifier(random_state=1)))
models_over.append(("Xgboost_over", XGBClassifier(random_state=1, eval_metric="logloss")))
models_over.append(("dtree_over", DecisionTreeClassifier(random_state=1)))
models_over.append(("logreg_over", LogisticRegression(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
#score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models_over:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
#calculate metrics for train and validation sets for each model
#concatenate model performance in the respective dataframe
for name, model in models_over:
model.fit(X_train_over, y_train_over)
score_train = model_performance_classification_sklearn(model, X_train_over, y_train_over)
score_val = model_performance_classification_sklearn(model, X_val, y_val)
score_val = score_val.set_index(pd.Index([name]))
score_val_df = pd.concat([score_val_df, score_val])
score_train = score_train.set_index(pd.Index([name]))
score_train_df = pd.concat([score_train_df, score_train])
# print the dataframes
print ("\n")
print ("Training Performance""\n")
display (score_train_df)
print ("\n")
print ("Validation Performance""\n")
display (score_val_df)
Cross-Validation Performance: Bagging_over: 91.36893578070048 RandomForest_over: 90.43635881871177 GBM_over: 92.79062484944838 Adaboost_over: 90.48634195693019 Xgboost_over: 95.09526906585731 dtree_over: 89.45536445536446 logreg_over: 61.01062292238764 Training Performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Bagging_def | 0.996 | 0.978 | 0.997 | 0.988 |
| RandomForest_def | 1.000 | 1.000 | 1.000 | 1.000 |
| GBM_def | 0.978 | 0.905 | 0.957 | 0.930 |
| Adaboost_def | 0.961 | 0.850 | 0.903 | 0.876 |
| Xgboost_def | 1.000 | 1.000 | 1.000 | 1.000 |
| dtree_def | 1.000 | 1.000 | 1.000 | 1.000 |
| logreg_def | 0.881 | 0.402 | 0.738 | 0.520 |
| Bagging_over | 0.997 | 0.992 | 0.998 | 0.995 |
| RandomForest_over | 1.000 | 1.000 | 1.000 | 1.000 |
| GBM_over | 0.977 | 0.951 | 0.966 | 0.959 |
| Adaboost_over | 0.958 | 0.918 | 0.934 | 0.926 |
| Xgboost_over | 1.000 | 1.000 | 1.000 | 1.000 |
| dtree_over | 1.000 | 1.000 | 1.000 | 1.000 |
| logreg_over | 0.817 | 0.582 | 0.722 | 0.645 |
Validation Performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Bagging_def | 0.952 | 0.776 | 0.913 | 0.839 |
| RandomForest_def | 0.940 | 0.687 | 0.918 | 0.786 |
| GBM_def | 0.961 | 0.816 | 0.933 | 0.871 |
| Adaboost_def | 0.955 | 0.813 | 0.898 | 0.853 |
| Xgboost_def | 0.969 | 0.865 | 0.937 | 0.900 |
| dtree_def | 0.932 | 0.755 | 0.812 | 0.782 |
| logreg_def | 0.869 | 0.325 | 0.702 | 0.444 |
| Bagging_over | 0.949 | 0.791 | 0.881 | 0.834 |
| RandomForest_over | 0.943 | 0.733 | 0.892 | 0.805 |
| GBM_over | 0.963 | 0.859 | 0.909 | 0.883 |
| Adaboost_over | 0.954 | 0.862 | 0.854 | 0.858 |
| Xgboost_over | 0.972 | 0.890 | 0.935 | 0.912 |
| dtree_over | 0.923 | 0.742 | 0.773 | 0.757 |
| logreg_over | 0.855 | 0.537 | 0.552 | 0.544 |
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(15, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
We will use RandomUnderSampler with sampling_strategy=1 for this section of model building. A sampling_strategy=1 indicates the over-represented class will have equal representation as the under-represented class after undersampling.
# fit random under sampler on the train data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
# print the counts
print("Before Under Sampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before Under Sampling, count of label '0': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, count of label '1': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, count of label '0': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, count of label '1': 976 Before Under Sampling, count of label '0': 5099 After Under Sampling, count of label '1': 976 After Under Sampling, count of label '0': 976 After Under Sampling, the shape of train_X: (1952, 34) After Under Sampling, the shape of train_y: (1952,)
#creating an empty list to store the models
models_under=[]
# Appending models_under into the list
models_under.append(("Bagging_under", BaggingClassifier(random_state=1)))
models_under.append(("RandomForest_under", RandomForestClassifier(random_state=1)))
models_under.append(("GBM_under", GradientBoostingClassifier(random_state=1)))
models_under.append(("Adaboost_under", AdaBoostClassifier(random_state=1)))
models_under.append(("Xgboost_under", XGBClassifier(random_state=1, eval_metric="logloss")))
models_under.append(("dtree_under", DecisionTreeClassifier(random_state=1)))
models_under.append(("logreg_under", LogisticRegression(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
#score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models_under:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
#calculate metrics for train and validation sets for each model
#concatenate model performance in the respective dataframe
for name, model in models_under:
model.fit(X_train_un, y_train_un)
score_train = model_performance_classification_sklearn(model, X_train_un, y_train_un)
score_val = model_performance_classification_sklearn(model, X_val, y_val)
score_val = score_val.set_index(pd.Index([name]))
score_val_df = pd.concat([score_val_df, score_val])
score_train = score_train.set_index(pd.Index([name]))
score_train_df = pd.concat([score_train_df, score_train])
# print the dataframes
print ("\n")
print ("Training Performance""\n")
display (score_train_df)
print ("\n")
print ("Validation Performance""\n")
display (score_val_df)
Cross-Validation Performance: Bagging_under: 91.28676085818942 RandomForest_under: 91.80272108843536 GBM_under: 94.56567242281528 Adaboost_under: 92.92673992673993 Xgboost_under: 94.56671899529043 dtree_under: 90.0596546310832 logreg_under: 80.52956567242282 Training Performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Bagging_def | 0.996 | 0.978 | 0.997 | 0.988 |
| RandomForest_def | 1.000 | 1.000 | 1.000 | 1.000 |
| GBM_def | 0.978 | 0.905 | 0.957 | 0.930 |
| Adaboost_def | 0.961 | 0.850 | 0.903 | 0.876 |
| Xgboost_def | 1.000 | 1.000 | 1.000 | 1.000 |
| dtree_def | 1.000 | 1.000 | 1.000 | 1.000 |
| logreg_def | 0.881 | 0.402 | 0.738 | 0.520 |
| Bagging_over | 0.997 | 0.992 | 0.998 | 0.995 |
| RandomForest_over | 1.000 | 1.000 | 1.000 | 1.000 |
| GBM_over | 0.977 | 0.951 | 0.966 | 0.959 |
| Adaboost_over | 0.958 | 0.918 | 0.934 | 0.926 |
| Xgboost_over | 1.000 | 1.000 | 1.000 | 1.000 |
| dtree_over | 1.000 | 1.000 | 1.000 | 1.000 |
| logreg_over | 0.817 | 0.582 | 0.722 | 0.645 |
| Bagging_under | 0.995 | 0.994 | 0.997 | 0.995 |
| RandomForest_under | 1.000 | 1.000 | 1.000 | 1.000 |
| GBM_under | 0.974 | 0.975 | 0.972 | 0.974 |
| Adaboost_under | 0.946 | 0.952 | 0.941 | 0.947 |
| Xgboost_under | 1.000 | 1.000 | 1.000 | 1.000 |
| dtree_under | 1.000 | 1.000 | 1.000 | 1.000 |
| logreg_under | 0.790 | 0.790 | 0.791 | 0.790 |
Validation Performance
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Bagging_def | 0.952 | 0.776 | 0.913 | 0.839 |
| RandomForest_def | 0.940 | 0.687 | 0.918 | 0.786 |
| GBM_def | 0.961 | 0.816 | 0.933 | 0.871 |
| Adaboost_def | 0.955 | 0.813 | 0.898 | 0.853 |
| Xgboost_def | 0.969 | 0.865 | 0.937 | 0.900 |
| dtree_def | 0.932 | 0.755 | 0.812 | 0.782 |
| logreg_def | 0.869 | 0.325 | 0.702 | 0.444 |
| Bagging_over | 0.949 | 0.791 | 0.881 | 0.834 |
| RandomForest_over | 0.943 | 0.733 | 0.892 | 0.805 |
| GBM_over | 0.963 | 0.859 | 0.909 | 0.883 |
| Adaboost_over | 0.954 | 0.862 | 0.854 | 0.858 |
| Xgboost_over | 0.972 | 0.890 | 0.935 | 0.912 |
| dtree_over | 0.923 | 0.742 | 0.773 | 0.757 |
| logreg_over | 0.855 | 0.537 | 0.552 | 0.544 |
| Bagging_under | 0.931 | 0.871 | 0.743 | 0.802 |
| RandomForest_under | 0.939 | 0.899 | 0.765 | 0.827 |
| GBM_under | 0.947 | 0.926 | 0.784 | 0.850 |
| Adaboost_under | 0.923 | 0.929 | 0.697 | 0.796 |
| Xgboost_under | 0.956 | 0.933 | 0.819 | 0.872 |
| dtree_under | 0.909 | 0.890 | 0.662 | 0.759 |
| logreg_under | 0.805 | 0.773 | 0.440 | 0.561 |
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(15, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Before we go further, let's evaluate the three sets of models we have. When we evaluate, let's define a threshold for our models and select only models that exceed beyond the threshold.
# rename columns of the train dataset
score_train_df = score_train_df.rename(
columns={
"Accuracy": "Accuracy_train",
"Recall": "Recall_train",
"Precision": "Precision_train",
"F1": "F1_train",
}
)
# rename the columns of the val dataset
score_val_df = score_val_df.rename(
columns={
"Accuracy": "Accuracy_val",
"Recall": "Recall_val",
"Precision": "Precision_val",
"F1": "F1_val",
}
)
# concatentate train and val scores for side-by-side comparison
final_model_comp = pd.concat([score_train_df, score_val_df], axis=1)
# rearrange the columns so that similar scores are next to each other
final_model_comp = final_model_comp[
[
"Accuracy_train",
"Accuracy_val",
"Recall_train",
"Recall_val",
"Precision_train",
"Precision_val",
"F1_train",
"F1_val",
]
]
# create a new column to store the difference in recall between train and val datasets
final_model_comp["RecallDiff"] = (
final_model_comp["Recall_train"] - final_model_comp["Recall_val"]
)
# create a new column to store the difference in precision between train and val datasets
final_model_comp["PrcDiff"] = (
final_model_comp["Precision_train"] - final_model_comp["Precision_val"]
)
# present the full dataframe
final_model_comp
| Accuracy_train | Accuracy_val | Recall_train | Recall_val | Precision_train | Precision_val | F1_train | F1_val | RecallDiff | PrcDiff | |
|---|---|---|---|---|---|---|---|---|---|---|
| Bagging_def | 0.996 | 0.952 | 0.978 | 0.776 | 0.997 | 0.913 | 0.988 | 0.839 | 0.202 | 0.084 |
| RandomForest_def | 1.000 | 0.940 | 1.000 | 0.687 | 1.000 | 0.918 | 1.000 | 0.786 | 0.313 | 0.082 |
| GBM_def | 0.978 | 0.961 | 0.905 | 0.816 | 0.957 | 0.933 | 0.930 | 0.871 | 0.089 | 0.023 |
| Adaboost_def | 0.961 | 0.955 | 0.850 | 0.813 | 0.903 | 0.898 | 0.876 | 0.853 | 0.038 | 0.005 |
| Xgboost_def | 1.000 | 0.969 | 1.000 | 0.865 | 1.000 | 0.937 | 1.000 | 0.900 | 0.135 | 0.063 |
| dtree_def | 1.000 | 0.932 | 1.000 | 0.755 | 1.000 | 0.812 | 1.000 | 0.782 | 0.245 | 0.188 |
| logreg_def | 0.881 | 0.869 | 0.402 | 0.325 | 0.738 | 0.702 | 0.520 | 0.444 | 0.076 | 0.036 |
| Bagging_over | 0.997 | 0.949 | 0.992 | 0.791 | 0.998 | 0.881 | 0.995 | 0.834 | 0.201 | 0.117 |
| RandomForest_over | 1.000 | 0.943 | 1.000 | 0.733 | 1.000 | 0.892 | 1.000 | 0.805 | 0.267 | 0.108 |
| GBM_over | 0.977 | 0.963 | 0.951 | 0.859 | 0.966 | 0.909 | 0.959 | 0.883 | 0.093 | 0.057 |
| Adaboost_over | 0.958 | 0.954 | 0.918 | 0.862 | 0.934 | 0.854 | 0.926 | 0.858 | 0.056 | 0.080 |
| Xgboost_over | 1.000 | 0.972 | 1.000 | 0.890 | 1.000 | 0.935 | 1.000 | 0.912 | 0.110 | 0.065 |
| dtree_over | 1.000 | 0.923 | 1.000 | 0.742 | 1.000 | 0.773 | 1.000 | 0.757 | 0.258 | 0.227 |
| logreg_over | 0.817 | 0.855 | 0.582 | 0.537 | 0.722 | 0.552 | 0.645 | 0.544 | 0.045 | 0.170 |
| Bagging_under | 0.995 | 0.931 | 0.994 | 0.871 | 0.997 | 0.743 | 0.995 | 0.802 | 0.123 | 0.253 |
| RandomForest_under | 1.000 | 0.939 | 1.000 | 0.899 | 1.000 | 0.765 | 1.000 | 0.827 | 0.101 | 0.235 |
| GBM_under | 0.974 | 0.947 | 0.975 | 0.926 | 0.972 | 0.784 | 0.974 | 0.850 | 0.049 | 0.188 |
| Adaboost_under | 0.946 | 0.923 | 0.952 | 0.929 | 0.941 | 0.697 | 0.947 | 0.796 | 0.022 | 0.245 |
| Xgboost_under | 1.000 | 0.956 | 1.000 | 0.933 | 1.000 | 0.819 | 1.000 | 0.872 | 0.067 | 0.181 |
| dtree_under | 1.000 | 0.909 | 1.000 | 0.890 | 1.000 | 0.662 | 1.000 | 0.759 | 0.110 | 0.338 |
| logreg_under | 0.790 | 0.805 | 0.790 | 0.773 | 0.791 | 0.440 | 0.790 | 0.561 | 0.017 | 0.351 |
# extract the rows where recall in training and val is greater than 90%
# extract rows where recall difference is less than 10%
# F1 scores greater than 75%
# sort by least difference between training recall and validation recall
final_model_comp[
(final_model_comp["Recall_train"] > 0.89)
& (final_model_comp["Recall_val"] > 0.89)
& (final_model_comp["RecallDiff"] > 0)
& (final_model_comp["RecallDiff"] < 0.1)
& (final_model_comp["F1_val"] > 0.75)
].sort_values("RecallDiff", ascending=True)
| Accuracy_train | Accuracy_val | Recall_train | Recall_val | Precision_train | Precision_val | F1_train | F1_val | RecallDiff | PrcDiff | |
|---|---|---|---|---|---|---|---|---|---|---|
| Adaboost_under | 0.946 | 0.923 | 0.952 | 0.929 | 0.941 | 0.697 | 0.947 | 0.796 | 0.022 | 0.245 |
| GBM_under | 0.974 | 0.947 | 0.975 | 0.926 | 0.972 | 0.784 | 0.974 | 0.850 | 0.049 | 0.188 |
| Xgboost_under | 1.000 | 0.956 | 1.000 | 0.933 | 1.000 | 0.819 | 1.000 | 0.872 | 0.067 | 0.181 |
Let's begin tuning our top best models using RandomizedSearchCV
# create an empty Dataframe to store scores of all models
tuned_score_train = pd.DataFrame()
tuned_score_val = pd.DataFrame()
# rename index of the final dataframe
tuned_score_train.index.name = "Model"
tuned_score_val.index.name = "Model"
# define the model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomizedSearchCV
parameters = {
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1,),
DecisionTreeClassifier(max_depth=2, random_state=1,),
DecisionTreeClassifier(max_depth=3, random_state=1,),
],
"n_estimators": np.arange(10, 110, 10),
"learning_rate": np.arange(0.1, 2, 0.1),
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=model,
param_distributions=parameters,
n_iter=20,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'n_estimators': 40, 'learning_rate': 0.5, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9477498691784406:
# fitting the model with best parameters from random search
abc_un_tuned = AdaBoostClassifier(
random_state=1,
n_estimators=40,
learning_rate=0.5,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
abc_un_tuned.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.5, n_estimators=40, random_state=1)
# performance metrics for undersampled training dataset
abc_un_tuned_train_model = model_performance_classification_sklearn(
abc_un_tuned, X_train_un, y_train_un
).rename(
index={0: "ABCun_Tuned"},
columns={
"Accuracy": "Accuracy_train",
"Recall": "Recall_train",
"Precision": "Precision_train",
"F1": "F1_train",
},
)
abc_un_tuned_train_model
| Accuracy_train | Recall_train | Precision_train | F1_train | |
|---|---|---|---|---|
| ABCun_Tuned | 1.000 | 1.000 | 1.000 | 1.000 |
# concatenating the score to the consolidated score dataframe
tuned_score_train = pd.concat([tuned_score_train, abc_un_tuned_train_model])
tuned_score_train
| Accuracy_train | Recall_train | Precision_train | F1_train | |
|---|---|---|---|---|
| ABCun_Tuned | 1.000 | 1.000 | 1.000 | 1.000 |
# performance metrics for validation dataset
abc_un_tuned_val_model = model_performance_classification_sklearn(
abc_un_tuned, X_val, y_val
).rename(
index={0: "ABCun_Tuned"},
columns={
"Accuracy": "Accuracy_val",
"Recall": "Recall_val",
"Precision": "Precision_val",
"F1": "F1_val",
},
)
abc_un_tuned_val_model
| Accuracy_val | Recall_val | Precision_val | F1_val | |
|---|---|---|---|---|
| ABCun_Tuned | 0.937 | 0.945 | 0.737 | 0.828 |
# concatenating the score to the consolidated score dataframe
tuned_score_val = pd.concat([tuned_score_val, abc_un_tuned_val_model])
tuned_score_val
| Accuracy_val | Recall_val | Precision_val | F1_val | |
|---|---|---|---|---|
| ABCun_Tuned | 0.937 | 0.945 | 0.737 | 0.828 |
# define the model initialized with Adaboost classifiers
model = GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1), random_state=1
)
# Grid of parameters to choose from
parameters = {
"n_estimators": [100, 150, 200, 250],
"subsample": [0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
"min_samples_split": [2, 4, 6, 8],
"min_samples_leaf": [1, 2, 3, 4],
"max_depth": [3, 4, 5, 6],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=model,
param_distributions=parameters,
n_iter=20,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'subsample': 1, 'n_estimators': 150, 'min_samples_split': 2, 'min_samples_leaf': 3, 'max_features': 0.8, 'max_depth': 4} with CV score=0.9579905808477237:
# fitting the data with best parameters from random search
gbm_under_tuned = GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1),
random_state=1,
subsample=1,
n_estimators=150,
max_features=0.8,
min_samples_split=2,
min_samples_leaf=3,
max_depth=4,
)
gbm_under_tuned.fit(X_train_un, y_train_un)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1), max_depth=4,
max_features=0.8, min_samples_leaf=3,
n_estimators=150, random_state=1, subsample=1)
# performance metrics of training dataset
gbm_under_tuned_train_model = model_performance_classification_sklearn(
gbm_under_tuned, X_train_un, y_train_un
).rename(
index={0: "GBMunder_Tuned"},
columns={
"Accuracy": "Accuracy_train",
"Recall": "Recall_train",
"Precision": "Precision_train",
"F1": "F1_train",
},
)
gbm_under_tuned_train_model
| Accuracy_train | Recall_train | Precision_train | F1_train | |
|---|---|---|---|---|
| GBMunder_Tuned | 1.000 | 1.000 | 1.000 | 1.000 |
# concatenating the scores to the final training dataframe
tuned_score_train = pd.concat([tuned_score_train, gbm_under_tuned_train_model])
tuned_score_train
| Accuracy_train | Recall_train | Precision_train | F1_train | |
|---|---|---|---|---|
| ABCun_Tuned | 1.000 | 1.000 | 1.000 | 1.000 |
| GBMunder_Tuned | 1.000 | 1.000 | 1.000 | 1.000 |
# performance metrics on validation dataset
gbm_under_tuned_val_model = model_performance_classification_sklearn(
gbm_under_tuned, X_val, y_val
).rename(
index={0: "GBMunder_Tuned"},
columns={
"Accuracy": "Accuracy_val",
"Recall": "Recall_val",
"Precision": "Precision_val",
"F1": "F1_val",
},
)
gbm_under_tuned_val_model
| Accuracy_val | Recall_val | Precision_val | F1_val | |
|---|---|---|---|---|
| GBMunder_Tuned | 0.960 | 0.939 | 0.834 | 0.883 |
# concatenating to the final validation score dataframe
tuned_score_val = pd.concat([tuned_score_val, gbm_under_tuned_val_model])
tuned_score_val
| Accuracy_val | Recall_val | Precision_val | F1_val | |
|---|---|---|---|---|
| ABCun_Tuned | 0.937 | 0.945 | 0.737 | 0.828 |
| GBMunder_Tuned | 0.960 | 0.939 | 0.834 | 0.883 |
# defining Adaboost Classifiers
model = XGBClassifier(random_state=1, eval_metric="logloss")
# Parameter grid to pass in RandomizedSearchCV
parameters = {
"n_estimators": np.arange(10, 100, 20),
"scale_pos_weight": [0, 1, 2, 5],
"subsample": [0.5, 0.7, 0.9, 1],
"learning_rate": [0.01, 0.1, 0.2, 0.05],
"gamma": [0, 1, 3],
"colsample_bytree": [0.5, 0.7, 0.9, 1],
"colsample_bylevel": [0.5, 0.7, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=model,
param_distributions=parameters,
n_iter=20,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 5, 'n_estimators': 10, 'learning_rate': 0.05, 'gamma': 1, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7} with CV score=0.9856514913657772:
xgb_under_tuned = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.9,
scale_pos_weight=5,
n_estimators=10,
learning_rate=0.05,
gamma=1,
colsample_bytree=0.7,
colsample_bylevel=0.7,
)
xgb_under_tuned.fit(X_train_un, y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=0.7, colsample_bynode=None,
colsample_bytree=0.7, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=1, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=10, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=1, ...)
xgb_under_tuned_train_model = model_performance_classification_sklearn(
xgb_under_tuned, X_train_un, y_train_un
).rename(
index={0: "XGBUnder_Tuned"},
columns={
"Accuracy": "Accuracy_train",
"Recall": "Recall_train",
"Precision": "Precision_train",
"F1": "F1_train",
},
)
xgb_under_tuned_train_model
| Accuracy_train | Recall_train | Precision_train | F1_train | |
|---|---|---|---|---|
| XGBUnder_Tuned | 0.893 | 1.000 | 0.824 | 0.904 |
tuned_score_train = pd.concat([tuned_score_train, xgb_under_tuned_train_model])
tuned_score_train
| Accuracy_train | Recall_train | Precision_train | F1_train | |
|---|---|---|---|---|
| ABCun_Tuned | 1.000 | 1.000 | 1.000 | 1.000 |
| GBMunder_Tuned | 1.000 | 1.000 | 1.000 | 1.000 |
| XGBUnder_Tuned | 0.893 | 1.000 | 0.824 | 0.904 |
xgb_under_tuned_val_model = model_performance_classification_sklearn(
xgb_under_tuned, X_val, y_val
).rename(
index={0: "XGBUnder_Tuned"},
columns={
"Accuracy": "Accuracy_val",
"Recall": "Recall_val",
"Precision": "Precision_val",
"F1": "F1_val",
},
)
xgb_under_tuned_val_model
| Accuracy_val | Recall_val | Precision_val | F1_val | |
|---|---|---|---|---|
| XGBUnder_Tuned | 0.802 | 0.994 | 0.448 | 0.618 |
tuned_score_val = pd.concat([tuned_score_val, xgb_under_tuned_val_model])
tuned_score_val
| Accuracy_val | Recall_val | Precision_val | F1_val | |
|---|---|---|---|---|
| ABCun_Tuned | 0.937 | 0.945 | 0.737 | 0.828 |
| GBMunder_Tuned | 0.960 | 0.939 | 0.834 | 0.883 |
| XGBUnder_Tuned | 0.802 | 0.994 | 0.448 | 0.618 |
# concatenating train and validation scores
final_tuned_model_comp = pd.concat([tuned_score_train, tuned_score_val,], axis=1,)
# re-arranging the columns to view similar scores side by side
final_tuned_model_comp = final_tuned_model_comp[
[
"Accuracy_train",
"Accuracy_val",
"Recall_train",
"Recall_val",
"Precision_train",
"Precision_val",
"F1_train",
"F1_val",
]
]
# create new column to capture difference between train and validation recall scores
final_tuned_model_comp["RecallDiff"] = (
final_tuned_model_comp["Recall_train"] - final_tuned_model_comp["Recall_val"]
)
# create new column to capture difference between train and validation precision scores
final_tuned_model_comp["PrcDiff"] = (
final_tuned_model_comp["Precision_train"] - final_tuned_model_comp["Precision_val"]
)
final_tuned_model_comp = final_tuned_model_comp.sort_values(
["RecallDiff", "PrcDiff"], ascending=True
)
final_tuned_model_comp
| Accuracy_train | Accuracy_val | Recall_train | Recall_val | Precision_train | Precision_val | F1_train | F1_val | RecallDiff | PrcDiff | |
|---|---|---|---|---|---|---|---|---|---|---|
| XGBUnder_Tuned | 0.893 | 0.802 | 1.000 | 0.994 | 0.824 | 0.448 | 0.904 | 0.618 | 0.006 | 0.376 |
| ABCun_Tuned | 1.000 | 0.937 | 1.000 | 0.945 | 1.000 | 0.737 | 1.000 | 0.828 | 0.055 | 0.263 |
| GBMunder_Tuned | 1.000 | 0.960 | 1.000 | 0.939 | 1.000 | 0.834 | 1.000 | 0.883 | 0.061 | 0.166 |
We can consider Adaboost undersampled as our final model.
# performance metrics for testidation dataset
abc_un_tuned_test_model = model_performance_classification_sklearn(
abc_un_tuned, X_test, y_test
).rename(
index={0: "ABCun_Tuned"},
columns={
"Accuracy": "Accuracy_test",
"Recall": "Recall_test",
"Precision": "Precision_test",
"F1": "F1_test",
},
)
abc_un_tuned_test_model
| Accuracy_test | Recall_test | Precision_test | F1_test | |
|---|---|---|---|---|
| ABCun_Tuned | 0.942 | 0.957 | 0.749 | 0.841 |
The test scores are also very promising. Recall on the test dataset is 96%.
As a final attempt to improve the model, we can try to perform a GridSearchCV from the best parameters we received from the random search CV.
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
# Best parameters are {'n_estimators': 40, 'learning_rate': 0.5, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}
param_grid = {
"n_estimators": np.arange(35, 45, 1),
"learning_rate": np.arange(0.45,0.6,0.1),
"base_estimator": [
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSearchCV
grid_cv.fit(X_train_un, y_train_un)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1), 'learning_rate': 0.55, 'n_estimators': 36}
Score: 0.9415960230245946
CPU times: total: 641 ms
Wall time: 14 s
# fitting the gridsearch tuned model
abc_gtuned = AdaBoostClassifier(
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
learning_rate=0.55,
n_estimators=36,
)
abc_gtuned.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.55, n_estimators=36, random_state=1)
# performance metrics on train set
abc_gtuned_train_model = model_performance_classification_sklearn(
abc_gtuned, X_train, y_train
).rename(
index={0: "ABC_Grid_Tuned"},
columns={
"Accuracy": "Accuracy_train",
"Recall": "Recall_train",
"Precision": "Precision_train",
"F1": "F1_train",
},
)
abc_gtuned_train_model
| Accuracy_train | Recall_train | Precision_train | F1_train | |
|---|---|---|---|---|
| ABC_Grid_Tuned | 0.951 | 0.999 | 0.766 | 0.867 |
# performance metrics on val set
abc_gtuned_val_model = model_performance_classification_sklearn(
abc_gtuned, X_val, y_val
).rename(
index={0: "ABC_Grid_Tuned"},
columns={
"Accuracy": "Accuracy_val",
"Recall": "Recall_val",
"Precision": "Precision_val",
"F1": "F1_val",
},
)
abc_gtuned_val_model
| Accuracy_val | Recall_val | Precision_val | F1_val | |
|---|---|---|---|---|
| ABC_Grid_Tuned | 0.933 | 0.939 | 0.725 | 0.818 |
# performance metrics of test set
abc_gtuned_test_model = model_performance_classification_sklearn(
abc_gtuned, X_test, y_test
).rename(
index={0: "ABC_Grid_Tuned"},
columns={
"Accuracy": "Accuracy_test",
"Recall": "Recall_test",
"Precision": "Precision_test",
"F1": "F1_test",
},
)
abc_gtuned_test_model
| Accuracy_test | Recall_test | Precision_test | F1_test | |
|---|---|---|---|---|
| ABC_Grid_Tuned | 0.941 | 0.932 | 0.756 | 0.835 |
Since the GridSearch CV did not improve performance, we can consider the undersampled Adaboost model with parameters from RandomSearch CV as our final model. Test recall on the final model is 95%, without compromising much on the F1 score. Let's see what the important features are and plot the confusion matrix as well.
Let's try to understand which are the most important features that the model is predicting.
feature_names = X_train.columns
importances = abc_un_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The most important features according to the model are:
This ties well with our observations from the EDA stages.
# plotting confusion matrix for test dataset
confusion_matrix_sklearn(abc_un_tuned, X_test, y_test)
True Positives (TP): the model correctly predicted 311 customers will attrite from the bank.
True Negatives (TN): the model correctly predicted that 1597 customers will stay with the bank.
False Positives (FP): the model incorrectly predicted that 104 customers will attrite from the bank (a "Type I error").
False Negatives (FN): the model incorrectly predicted that 14 customers will stay with the bank (a "Type II error").
Before creating the pipeline, we can do the following steps:
In order to build the pipeline, we will perform the following steps.
# create a copy of the dataset
data = thera.copy()
# Separating target variable and other variables
X = data.drop(columns="Attrition_Flag")
Y = data["Attrition_Flag"]
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1,
)
print(X_train.shape, X_test.shape)
(7088, 20) (3039, 20)
# encode the target variable in train set
y_train.replace(
to_replace={"Attrited Customer": 1, "Existing Customer": 0}, inplace=True
)
# encode the target variable in test set
y_test.replace(
to_replace={"Attrited Customer": 1, "Existing Customer": 0}, inplace=True
)
# defining function to replace erroneous values with nan in features
def myProcessingSteps(df):
# Replacing the values in Income_Category column
df["Income_Category"].replace({"abc": np.nan}, inplace=True)
return df
# The function created for processing the data should be passed as an arugument in the FunctionTransformer
processing = FunctionTransformer(myProcessingSteps)
# creating a list of numerical variables
numerical_features = [
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
# creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
# creating a list of categorical variables
categorical_features = [
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
# creating a transformer for categorical variables, which will first apply simple imputer and
# then do one hot encoding for categorical variables
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(sparse=False, handle_unknown="ignore")),
]
)
# handle_unknown = "ignore", allows model to handle any unknown category in the test data
# combining categorical transformer and numerical transformer using a column transformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
],
remainder="drop",
)
# remainder = "drop" will allow variables that are present in original data
# but not in "numerical_columns" and "categorical_columns" to be dropped, example CLIENTNUM
# Creating new pipeline with best parameters
prod_pipe = ImbPipeline(
steps=[
("data_processing", processing),
("pre", preprocessor),
("undersample", RandomUnderSampler(random_state=1, sampling_strategy=1)),
(
"ABC",
AdaBoostClassifier(
random_state=1,
n_estimators=40,
learning_rate=0.5,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
),
),
]
)
# Display Pipeline
set_config(display="diagram")
# # Fit the model on training data
prod_pipe.fit(X_train, y_train)
Pipeline(steps=[('data_processing',
FunctionTransformer(func=<function myProcessingSteps at 0x0000018FB35F15E0>)),
('pre',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['Customer_Age',
'Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon',
'Cre...
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['Gender', 'Education_Level',
'Marital_Status',
'Income_Category',
'Card_Category'])])),
('undersample',
RandomUnderSampler(random_state=1, sampling_strategy=1)),
('ABC',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.5, n_estimators=40,
random_state=1))])Please rerun this cell to show the HTML repr or trust the notebook.Pipeline(steps=[('data_processing',
FunctionTransformer(func=<function myProcessingSteps at 0x0000018FB35F15E0>)),
('pre',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['Customer_Age',
'Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon',
'Cre...
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['Gender', 'Education_Level',
'Marital_Status',
'Income_Category',
'Card_Category'])])),
('undersample',
RandomUnderSampler(random_state=1, sampling_strategy=1)),
('ABC',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.5, n_estimators=40,
random_state=1))])FunctionTransformer(func=<function myProcessingSteps at 0x0000018FB35F15E0>)
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['Customer_Age', 'Dependent_count',
'Months_on_book', 'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy',
'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='most_frequent')),
('onehot',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['Gender', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category'])])['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
SimpleImputer(strategy='median')
['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse=False)
RandomUnderSampler(random_state=1, sampling_strategy=1)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.5, n_estimators=40, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
# calculate metrics of the production pipeline on unseen data
model_performance_classification_sklearn(prod_pipe, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.947 | 0.941 | 0.777 | 0.851 |
Production pipeline has been created, and the scores are similar as the ones we obtained on unseen data in model building stages.
We have extensively analysed the data and built several models based on decision trees, logistic regressions and ensemble techniques. Before we make recommendations on which model to use, let us look at a summary of the work we've done.
High number of outliers
Few outliers
These are the characteristics of customers who have attrited according to the card type they hold.
Customer Profile
Relationship Strategy
Credit & Balance
When a bank like Thera Bank is trying to contain attrition of customers, the classification model must have the potential to predict as many customers who are likely to attrite. Customer who are predicted to stay, but would actually attrite can pose a severe risk and loss to the bank. Therefore, it is imperative to reduce false negatives. For this reason, Recall is the metric of interest here.
The pipeline was built in the following manner